A step-by-step guide for constructing an SQLite database using Airbnb data.
For this project, we demonstrate how to separate, clean and upload host and listing data taken from Airbnb for the city of Bangkok in Thailand.
The following libraries are used for data cleaning and database construction
Begin by loading the downloaded data into R using the read_csv() command from the readr library.
data <- read_csv("../data/listings.csv.gz")
We also need to form a database connection object:
Before moving on to the data processing stage, we first define a few useful functions.
remove_live_database() functionWhile we are actively working on our database construction, we may find it useful to periodically disconnect and rebuild our database from scratch. Having to close R, delete the airdb.SQLite file and reopen our session repeatedly can be tedious. To work around this, we use the following remove_live_database() function to disconnect and delete any current working database while R is still open:
remove_live_database <- function(con){
if(file.exists("airdb.SQLite")){
if (exists("con")){
dbDisconnect(con)
}
file.remove("airdb.SQLite")
}
}
insert_to_sql() functionIn order to facilitate the insertion of data into our RSQLite database, we the following insert_to_sql() function. This function performs a number of operations prior to insertion:
data argument.NA out values are in the correct form. For all non NA values, replace the double quotes with single quotes and bookend all strings with double quotes.NA values to NULL and trim all additional whitespace.\\’s. Finally, use the paste0() function to convert our string into a query prior to insertion in our database’s corresponding relevant table.insert_to_sql <- function(con, table, data){
# (1)
column_name <- paste(names(data), collapse = ", ")
# (2)
data_new <- data %>%
mutate_if(is.character, function(x) ifelse(is.na(x), NA, x %>%
str_replace_all('"', "'") %>% # Replace " with '
paste0('"', . , '"') # Add " before and after string
)
)
value_data <- apply(data_new, MARGIN = 1,
function(x) x %>%
paste(collapse = ",") %>% # Join all column into single string
str_replace_all("\\bNA\\b", "NULL") %>% # Create NULL from NA
str_trim() # remove unnecessary whitespace
)
# (4)
query_value <- paste(value_data) %>%
paste0("(", ., ")") %>% # Add bracket before and after string
paste(collapse = ", ") %>% # Join all values into single string
str_remove_all("\\\\") %>% # Remove \\ from string
paste0("INSERT INTO ", table, "(", column_name, ") VALUES ", .)
# (5)
dbSendQuery(con, query_value)
}
The data used in for this project is provided by Airbnb and can be found here. To get started, first download the listings.csv.gz from the previous link. In order to help streamline the process of constructing our database, it is beneficial to split the .csv file into a listing table and a host_info table. The E/R (Entity/Relationship) diagram in the next section shows how these two tables are related and the attributes they each hold.
From here, we will work on separating and cleaning the two respective tables prior to insertion in our database.
We can examine our initial data using the glimpse() command from the tibble library.
Rows: 17,040
Columns: 74
$ id <dbl> 27934, 27979, 2~
$ listing_url <chr> "https://www.ai~
$ scrape_id <dbl> 2.021122e+13, 2~
$ last_scraped <date> 2021-12-24, 20~
$ name <chr> "Nice room with~
$ description <chr> "Our cool and c~
$ neighborhood_overview <chr> "It is very cen~
$ picture_url <chr> "https://a0.mus~
$ host_id <dbl> 120437, 120541,~
$ host_url <chr> "https://www.ai~
$ host_name <chr> "Nuttee", "Emy"~
$ host_since <date> 2010-05-08, 20~
$ host_location <chr> "Bangkok", "Ban~
$ host_about <chr> "Hi All, I am n~
$ host_response_time <chr> "N/A", "N/A", "~
$ host_response_rate <chr> "N/A", "N/A", "~
$ host_acceptance_rate <chr> "N/A", "N/A", "~
$ host_is_superhost <lgl> FALSE, FALSE, F~
$ host_thumbnail_url <chr> "https://a0.mus~
$ host_picture_url <chr> "https://a0.mus~
$ host_neighbourhood <chr> "Victory Monume~
$ host_listings_count <dbl> 2, 2, 1, 1, 1, ~
$ host_total_listings_count <dbl> 2, 2, 1, 1, 1, ~
$ host_verifications <chr> "['email', 'pho~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRU~
$ host_identity_verified <lgl> TRUE, FALSE, FA~
$ neighbourhood <chr> "Samsen Nai, Ba~
$ neighbourhood_cleansed <chr> "Ratchathewi", ~
$ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA,~
$ latitude <dbl> 13.75983, 13.66~
$ longitude <dbl> 100.5413, 100.6~
$ property_type <chr> "Entire condomi~
$ room_type <chr> "Entire home/ap~
$ accommodates <dbl> 3, 2, 2, 2, 2, ~
$ bathrooms <lgl> NA, NA, NA, NA,~
$ bathrooms_text <chr> "1.5 baths", "1~
$ bedrooms <dbl> 1, 1, 1, 1, 1, ~
$ beds <dbl> 1, 2, 1, 1, 1, ~
$ amenities <chr> "[\"Wifi\", \"H~
$ price <chr> "$1,845.00", "$~
$ minimum_nights <dbl> 3, 1, 60, 5, 1,~
$ maximum_nights <dbl> 90, 730, 730, 3~
$ minimum_minimum_nights <dbl> 3, 1, 60, 5, 1,~
$ maximum_minimum_nights <dbl> 3, 1, 60, 5, 1,~
$ minimum_maximum_nights <dbl> 90, 730, 730, 3~
$ maximum_maximum_nights <dbl> 90, 730, 730, 3~
$ minimum_nights_avg_ntm <dbl> 3, 1, 60, 5, 1,~
$ maximum_nights_avg_ntm <dbl> 90, 730, 730, 3~
$ calendar_updated <lgl> NA, NA, NA, NA,~
$ has_availability <lgl> TRUE, TRUE, TRU~
$ availability_30 <dbl> 27, 23, 29, 11,~
$ availability_60 <dbl> 57, 53, 59, 11,~
$ availability_90 <dbl> 87, 83, 89, 11,~
$ availability_365 <dbl> 362, 358, 364, ~
$ calendar_last_scraped <date> 2021-12-24, 20~
$ number_of_reviews <dbl> 65, 0, 0, 1, 0,~
$ number_of_reviews_ltm <dbl> 0, 0, 0, 0, 0, ~
$ number_of_reviews_l30d <dbl> 0, 0, 0, 0, 0, ~
$ first_review <date> 2012-04-07, NA~
$ last_review <date> 2020-01-06, NA~
$ review_scores_rating <dbl> 4.85, NA, NA, 4~
$ review_scores_accuracy <dbl> 4.95, NA, NA, 5~
$ review_scores_cleanliness <dbl> 4.81, NA, NA, 3~
$ review_scores_checkin <dbl> 4.97, NA, NA, 2~
$ review_scores_communication <dbl> 4.91, NA, NA, 4~
$ review_scores_location <dbl> 4.66, NA, NA, 4~
$ review_scores_value <dbl> 4.75, NA, NA, 5~
$ license <lgl> NA, NA, NA, NA,~
$ instant_bookable <lgl> FALSE, FALSE, F~
$ calculated_host_listings_count <dbl> 2, 2, 1, 1, 1, ~
$ calculated_host_listings_count_entire_homes <dbl> 2, 1, 0, 0, 0, ~
$ calculated_host_listings_count_private_rooms <dbl> 0, 1, 1, 1, 1, ~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, ~
$ reviews_per_month <dbl> 0.55, NA, NA, 0~
Before we separate the data into two separate tables, we perform the following preprocessing steps on all entries:
date to the type character to help process NA values.NA representations (blanks, None, N/A, NA) to NA.data.data <- data %>%
# (1) Convert dates to characters for NA values
mutate(last_scraped = as.character(last_scraped),
host_since = as.character(host_since),
calendar_last_scraped = as.character(calendar_last_scraped),
first_review = as.character(first_review),
last_review = as.character(last_review),
) %>%
# (2) Homogenize NA values
#*# Taken from: https://rpubs.com/Argaadya/create_table_sql
mutate_all(function(x) ifelse(x == "" | x == "None" | x == "N/A", NA, x)) %>% #*#
# mutate_all(function(x) ifelse(is.na(x), "NULL", x)) %>%
# (3) Convert character strings back to date type
mutate(last_scraped = as.Date(last_scraped),
host_since = as.Date(host_since),
calendar_last_scraped = as.Date(calendar_last_scraped),
first_review = as.Date(first_review),
last_review = as.Date(last_review))
We are now ready to move on to constructing our host_info table.
Since a host can have many listings, it’s beneficial to split our initial data into two tables, one containing the info related to hosts and the other containing the info related to listings, before inserting into our database. The first table that we create, is the host_info table with the same attributes as shown in the E/R Diagram above.
We are now ready to extract and clean the data from our initial data table in order to construct our host_data table. This is done and the following four steps:
data table. Note that we use the : syntax to grab many columns at once. This syntax is inclusive.distinct() function from the dplyr library.host_since column back to the type character. This is required since RSQLite does not support data of the type date. We will convert this column back to the correct type when performing queries later.str_remove_all() function from the stringr library to convert the host_verifications sublists into simple strings. # (1) Extract host data
host_data <- data %>%
select(host_id:host_identity_verified,
calculated_host_listings_count:calculated_host_listings_count_shared_rooms)
# (2) Remove duplicate values
host_data <- host_data %>% distinct()
# (3) Convert dates
# Note that this will need to converted back to type = date for analysis
host_data <- host_data %>% mutate(host_since = as.character(host_since))
# (4) Clean host verification column
host_data <-
host_data %>%
mutate(host_verifications = str_remove_all(host_verifications, "[\\'\\[\\]]"))
We can now view our clean data:
rmarkdown::paged_table(host_data)
Now that our host_data table is clean. We can create the equivalent table as a query, initially as a string listing the table’s columns, before creating the empty table in our database using our con object:
#################### Create table for host info
query <- "CREATE TABLE host_info(
host_id INT,
host_url VARCHAR(50),
host_name VARCHAR(100),
host_since VARCHAR(50),
host_location VARCHAR(500),
host_about VARCHAR(10000),
host_response_time VARCHAR(50),
host_response_rate VARCHAR(50),
host_acceptance_rate VARCHAR(50),
host_is_superhost BOOLEAN,
host_thumbnail_url VARCHAR(500),
host_picture_url VARCHAR(500),
host_neighbourhood VARCHAR(50),
host_listings_count INT,
host_total_listings_count INT,
host_verifications VARCHAR(500),
host_has_profile_pic BOOLEAN,
host_identity_verified BOOLEAN,
calculated_host_listings_count INT,
calculated_host_listings_count_entire_homes INT,
calculated_host_listings_count_private_rooms INT,
calculated_host_listings_count_shared_rooms INT,
PRIMARY KEY(host_id)
)"
Now create the empty table in our database.
dbSendQuery(con, query)
<SQLiteResult>
SQL CREATE TABLE host_info(
host_id INT,
host_url VARCHAR(50),
host_name VARCHAR(100),
host_since VARCHAR(50),
host_location VARCHAR(500),
host_about VARCHAR(10000),
host_response_time VARCHAR(50),
host_response_rate VARCHAR(50),
host_acceptance_rate VARCHAR(50),
host_is_superhost BOOLEAN,
host_thumbnail_url VARCHAR(500),
host_picture_url VARCHAR(500),
host_neighbourhood VARCHAR(50),
host_listings_count INT,
host_total_listings_count INT,
host_verifications VARCHAR(500),
host_has_profile_pic BOOLEAN,
host_identity_verified BOOLEAN,
calculated_host_listings_count INT,
calculated_host_listings_count_entire_homes INT,
calculated_host_listings_count_private_rooms INT,
calculated_host_listings_count_shared_rooms INT,
PRIMARY KEY(host_id)
)
ROWS Fetched: 0 [complete]
Changed: 0
Next, we check the schema of our database so far to ensure that it was loaded correctly.
res <- dbSendQuery(con, "PRAGMA table_info([host_info]);")
fetch(res)
cid name type
1 0 host_id INT
2 1 host_url VARCHAR(50)
3 2 host_name VARCHAR(100)
4 3 host_since VARCHAR(50)
5 4 host_location VARCHAR(500)
6 5 host_about VARCHAR(10000)
7 6 host_response_time VARCHAR(50)
8 7 host_response_rate VARCHAR(50)
9 8 host_acceptance_rate VARCHAR(50)
10 9 host_is_superhost BOOLEAN
11 10 host_thumbnail_url VARCHAR(500)
12 11 host_picture_url VARCHAR(500)
13 12 host_neighbourhood VARCHAR(50)
14 13 host_listings_count INT
15 14 host_total_listings_count INT
16 15 host_verifications VARCHAR(500)
17 16 host_has_profile_pic BOOLEAN
18 17 host_identity_verified BOOLEAN
19 18 calculated_host_listings_count INT
20 19 calculated_host_listings_count_entire_homes INT
21 20 calculated_host_listings_count_private_rooms INT
22 21 calculated_host_listings_count_shared_rooms INT
notnull dflt_value pk
1 0 NA 1
2 0 NA 0
3 0 NA 0
4 0 NA 0
5 0 NA 0
6 0 NA 0
7 0 NA 0
8 0 NA 0
9 0 NA 0
10 0 NA 0
11 0 NA 0
12 0 NA 0
13 0 NA 0
14 0 NA 0
15 0 NA 0
16 0 NA 0
17 0 NA 0
18 0 NA 0
19 0 NA 0
20 0 NA 0
21 0 NA 0
22 0 NA 0
dbClearResult(res)
Finally, insert our host_data table into the equivalent table in our RSQLite database using the insert_to_sql() function as defined above.
insert_to_sql(con, "host_info", host_data)
We can verify the contents of our newly created and populated RSQLite database table host_info by viewing the first 10 rows of each column as follows:
res <- dbSendQuery(con, "SELECT * FROM host_info LIMIT 10")
out_db <- fetch(res)
dbClearResult(res)
rmarkdown::paged_table(out_db)
We are now ready to move on to constructing our listing table. This table contains information related to the types of listings available and their relevant attributes. The process of constructing the listing table is very similar to the process of constructing the host_info table.
In order to prepare the relevant columns related to listing information we perform the following: 1. Remove all columns related to host_data. 2. Remove additional unnecessary columns. 3. Remove the dollar signs in the price column. 4. Perform the same transformation as we did above on the host_verifications column to the amenities column. 5. Convert all columns of the type date to the type character as is required for insertion into our database.
# (1) Remove host_data columns
listing_data <- data %>%
select( - names(host_data)[-1])
# (2) Remove extraneous columns
listing_data <- listing_data %>%
select(-c(license, calendar_updated, bathrooms, scrape_id))
# (3) Remove dollar signs from price column
listing_data <- listing_data %>%
mutate(price = str_remove_all(price, "[$,]") %>%
as.numeric()
)
# (4) Transform amenities and host verification column
listing_data <- listing_data %>%
mutate(amenities = str_remove_all(amenities, "[\"\\'\\[\\]]"))
# (5) Convert dates to character
listing_data <-
listing_data %>%
mutate(last_scraped = as.character(last_scraped),
calendar_last_scraped = as.character(calendar_last_scraped),
first_review = as.character(first_review),
last_review = as.character(last_review))
Just as before, create our listing database table initially as a string to be inserted as a query.
#################### Create listing table
query_2 <- [1856 chars quoted with '"']
Insert the listing table into our database.
dbSendQuery(con, query_2)
<SQLiteResult>
SQL CREATE TABLE listing (
id INT,
listing_url VARCHAR(100),
last_scraped VARCHAR(50),
name VARCHAR(500),
description VARCHAR(2000),
neighborhood_overview VARCHAR(2000),
neighbourhood VARCHAR(100),
neighbourhood_cleansed VARCHAR(100),
neighbourhood_group_cleansed VARCHAR(100),
latitude DECIMAL(25,18),
longitude DECIMAL(25, 18),
property_type VARCHAR(100),
room_type VARCHAR(100),
picture_url VARCHAR(500),
host_id INT,
accommodates INT,
bathrooms_text VARCHAR(100),
bedrooms INT,
beds INT,
amenities VARCHAR(2000),
price DECIMAL(15, 5),
minimum_nights INT,
maximum_nights INT,
minimum_minimum_nights INT,
maximum_minimum_nights INT,
minimum_maximum_nights INT,
maximum_maximum_nights INT,
minimum_nights_avg_ntm DECIMAL(16, 5),
maximum_nights_avg_ntm DECIMAL(16, 5),
has_availability BOOLEAN,
availability_30 INT,
availability_60 INT,
availability_90 INT,
availability_365 INT,
calendar_last_scraped VARCHAR(50),
number_of_reviews INT,
number_of_reviews_ltm INT,
number_of_reviews_l30d INT,
first_review VARCHAR(50),
last_review VARCHAR(50),
review_scores_rating DECIMAL(10, 5),
review_scores_accuracy DECIMAL(10, 5),
review_scores_cleanliness DECIMAL(10, 5),
review_scores_checkin DECIMAL(10, 5),
review_scores_communication DECIMAL(10, 5),
review_scores_location DECIMAL(10, 5),
review_scores_value DECIMAL(10, 5),
instant_bookable BOOLEAN,
reviews_per_month DECIMAL(10, 5),
PRIMARY KEY(id),
FOREIGN KEY(host_id) REFERENCES host_info(host_id)
)
ROWS Fetched: 0 [complete]
Changed: 0
Insert the listing data into our database using the same insert_to_sql() function as before.
insert_to_sql(con, "listing", listing_data)
Confirm that the data was indeed inserted correctly.
res <- dbSendQuery(con, "SELECT * FROM listing LIMIT 10")
out_db <- fetch(res)
dbClearResult(res)
rmarkdown::paged_table(out_db)
Finally, before exiting our program, disconnect from the database
dbDisconnect(con)
We have provided an example of how to pull real-world data from an actual source, then separate, clean and load the data into a RSQLite database. Real-world data is often times initially messy, but that does not mean it cannot be wrangled into a form from which we can gain deeper insights. In our next post, we demonstrate how to answer some interesting questions given our newly created database.